Which Distance Metric is Right: An Evolutionary K-Means View

نویسندگان

  • Chuanren Liu
  • Tianming Hu
  • Yong Ge
  • Hui Xiong
چکیده

It is well known that the distance metric plays an important role in the clustering process. Indeed, many clustering problems can be treated as an optimization problem of a criterion function defined over one distance metric. While many distance metrics have been developed, it is not clear that how these distance metrics can impact on the clustering/optimization process. To that end, in this paper, we study the impact of a set of popular cosine-based distance metrics on K-means clustering. Specifically, by revealing the common order-preserving property, we first show that K-means has exactly the same cluster assignment for these metrics during the E-step. Next, by both theoretical and empirical studies, we prove that the cluster centroid is a good approximator of their respective optimal centers in the M-step. As such, we identify a problem with K-means: it cannot differentiate these metrics. To explore the nature of these metrics, we propose an evolutionary K-means framework that integrates K-means and genetic algorithms. This framework not only enables inspection of arbitrary distance metrics, but also can be used to investigate different formulations of the optimization problem. Finally, this framework is used in extensive experiments on real-world data sets. The results validate our theoretical findings on the characteristics and interrelationships of these metrics. Most importantly, this paper furthers our understanding of the impact of the distance metrics on the optimization process of K-means.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

یادگیری نیمه نظارتی کرنل مرکب با استفاده از تکنیک‌های یادگیری معیار فاصله

Distance metric has a key role in many machine learning and computer vision algorithms so that choosing an appropriate distance metric has a direct effect on the performance of such algorithms. Recently, distance metric learning using labeled data or other available supervisory information has become a very active research area in machine learning applications. Studies in this area have shown t...

متن کامل

Distance Based Hybrid Approach for Cluster Analysis Using Variants of K-means and Evolutionary Algorithm

Clustering is a process of grouping same objects into a specified number of clusters. K-means and Kmedoids algorithms are the most popular partitional clustering techniques for large data sets. However, they are sensitive to random selection of initial centroids and are fall into local optimal solution. K-means++ algorithm has good convergence rate than other algorithms. Distance metric is used...

متن کامل

An Effective Approach for Robust Metric Learning in the Presence of Label Noise

Many algorithms in machine learning, pattern recognition, and data mining are based on a similarity/distance measure. For example, the kNN classifier and clustering algorithms such as k-means require a similarity/distance function. Also, in Content-Based Information Retrieval (CBIR) systems, we need to rank the retrieved objects based on the similarity to the query. As generic measures such as ...

متن کامل

Algebraic distance in algebraic cone metric spaces and its properties

In this paper, we prove some properties of algebraic cone metric spaces and introduce the notion of algebraic distance in an algebraic cone metric space. As an application, we obtain some famous fixed point results in the framework of this algebraic distance.

متن کامل

Completeness in Probabilistic Metric Spaces

The idea of probabilistic metric space was introduced by Menger and he showed that probabilistic metric spaces are generalizations of metric spaces. Thus, in this paper, we prove some of the important features and theorems and conclusions that are found in metric spaces. At the beginning of this paper, the distance distribution functions are proposed. These functions are essential in defining p...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012